First, we will read in the MISR datasets which have been matched to the AQS and CSN datasets. These data were matched spatially by considering every AQS/CSN data collection site within a 2.2 km radius of a MISR data pixel, and these matches were further filtered by matching these observations based on the dates when they were recorded.
We will also slightly alter these datasets, by changing the way that dates are stored in the data. Instead of storing dates as one object in a YYYY-MM-DD format, we will instead store the day, month, and year as three separate attributes.
In addition to the data which we collected from the CSN dataset, we will also use a formula to estimate the total dust mass in a given area, based on the presence of certain elements.
The formula to compute dust mass is given by \(\text{Dust Mass} = 2.2\times\text{Al} + 2.49\times\text{Si} + 1.63\times\text{Ca} + 1.94\times\text{Ti} + 2.42\times\text{Fe}\).
First, we will do some exploratory data analysis for these datasets, so we can have a better understanding of the data which we collected.
We will examine some brief numerical summaries of our main four variables, in order to know more about their general distributions.
| Minimum | 25th percentile | Median | 75th percentile | Maximum | IQR | Range | Mean | Standard Deviation | Present Values | Missing Values | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Dust Mass | -0.0284 | 0.3503 | 0.6126 | 1.0773 | 18.3301 | 0.7270 | 18.3584 | 0.9065 | 1.0361 | 5115 | 174 |
| Nitrate | 0.0000 | 0.6300 | 1.4000 | 3.4400 | 53.9000 | 2.8100 | 53.9000 | 3.0576 | 4.6030 | 5073 | 216 |
| Sulfate | 0.0000 | 0.5640 | 0.9943 | 1.6800 | 10.7000 | 1.1160 | 10.7000 | 1.3069 | 1.1535 | 5094 | 195 |
| PM2.5 | -7.2000 | 5.0000 | 8.0833 | 12.4583 | 529.4167 | 7.4583 | 536.6167 | 10.4596 | 10.6680 | 157005 | 0 |
Based on the table above, we see that there are a few negative values recorded for Dust Mass and PM2.5 concentrations. As a concentration must be strictly non-negative (as we cannot have negative amounts of a particle), we will replace all negative values with 0.
In addition to examining numerical summaries of these values, we will also examine histograms of the values to see the overall distributions of these variables in a more visual manner.
From the plots above, we see that the distributions for each of these four variables are all somewhat right-skewed, as there are quite a few high-valued outliers in these datasets, and there are not a corresponding amount of low values in these data, as these data are all strictly non-negative.
The log-plots above all appear to be relatively symmetrical and look somewhat like Normal distributions, which may be helpful for model fitting and prediction purposes, as these distributions are significantly less skewed by their few large values.
In addition to the histograms which show the general distributions of these data over our 22-year period, we have created time series plots of dust mass, nitrate, PM2.5, and sulfate concentrations in California over time as a way to visualize how these quantities have changed over time.
The PM2.5 data is sourced from the AQS data collection sites, whereas the dust mass, nitrate, and sulfate concentrations come from the CSN datasets.
Monthly PM2.5 Concentrations in California
Monthly Dust Mass Concentrations in California
To start off, we will examine counts of missing values in our datasets, to determine how much of the data which we aim to use is actually present in the dataset.
| Variable Name | Recorded Values | Missing Values |
|---|---|---|
| PM25 | 157005 | 0 |
| Year | 157005 | 0 |
| Month | 157005 | 0 |
| Day | 157005 | 0 |
| AOD | 50089 | 106916 |
| AOD_uncertainty | 50089 | 106916 |
| angstrom_exp_550_860 | 50089 | 106916 |
| AOD_absorption | 50089 | 106916 |
| AOD_nonspherical | 50089 | 106916 |
| small_mode_AOD | 50089 | 106916 |
| medium_mode_AOD | 50089 | 106916 |
| large_mode_AOD | 50089 | 106916 |
| aod_mix_01 | 57964 | 99041 |
| aod_mix_02 | 58151 | 98854 |
| aod_mix_03 | 58387 | 98618 |
| aod_mix_04 | 58675 | 98330 |
| aod_mix_05 | 58870 | 98135 |
| aod_mix_06 | 59079 | 97926 |
| aod_mix_07 | 59220 | 97785 |
| aod_mix_08 | 59105 | 97900 |
| aod_mix_09 | 58027 | 98978 |
| aod_mix_10 | 54411 | 102594 |
| aod_mix_11 | 63009 | 93996 |
| aod_mix_12 | 62969 | 94036 |
| aod_mix_13 | 62990 | 94015 |
| aod_mix_14 | 62899 | 94106 |
| aod_mix_15 | 62332 | 94673 |
| aod_mix_16 | 61316 | 95689 |
| aod_mix_17 | 59339 | 97666 |
| aod_mix_18 | 56133 | 100872 |
| aod_mix_19 | 51824 | 105181 |
| aod_mix_20 | 46472 | 110533 |
| aod_mix_21 | 49385 | 107620 |
| aod_mix_22 | 49028 | 107977 |
| aod_mix_23 | 48577 | 108428 |
| aod_mix_24 | 47605 | 109400 |
| aod_mix_25 | 46377 | 110628 |
| aod_mix_26 | 45070 | 111935 |
| aod_mix_27 | 43693 | 113312 |
| aod_mix_28 | 42072 | 114933 |
| aod_mix_29 | 40508 | 116497 |
| aod_mix_30 | 38868 | 118137 |
| aod_mix_31 | 61811 | 95194 |
| aod_mix_32 | 61716 | 95289 |
| aod_mix_33 | 61529 | 95476 |
| aod_mix_34 | 60895 | 96110 |
| aod_mix_35 | 60081 | 96924 |
| aod_mix_36 | 58316 | 98689 |
| aod_mix_37 | 55627 | 101378 |
| aod_mix_38 | 52266 | 104739 |
| aod_mix_39 | 48168 | 108837 |
| aod_mix_40 | 43772 | 113233 |
| aod_mix_41 | 53302 | 103703 |
| aod_mix_42 | 53203 | 103802 |
| aod_mix_43 | 53113 | 103892 |
| aod_mix_44 | 52600 | 104405 |
| aod_mix_45 | 51719 | 105286 |
| aod_mix_46 | 50327 | 106678 |
| aod_mix_47 | 48476 | 108529 |
| aod_mix_48 | 46027 | 110978 |
| aod_mix_49 | 43313 | 113692 |
| aod_mix_50 | 40558 | 116447 |
| aod_mix_51 | 60791 | 96214 |
| aod_mix_52 | 54792 | 102213 |
| aod_mix_53 | 38906 | 118099 |
| aod_mix_54 | 51407 | 105598 |
| aod_mix_55 | 43584 | 113421 |
| aod_mix_56 | 33043 | 123962 |
| aod_mix_57 | 38243 | 118762 |
| aod_mix_58 | 33797 | 123208 |
| aod_mix_59 | 29758 | 127247 |
| aod_mix_60 | 29972 | 127033 |
| aod_mix_61 | 29164 | 127841 |
| aod_mix_62 | 28486 | 128519 |
| aod_mix_63 | 38842 | 118163 |
| aod_mix_64 | 37758 | 119247 |
| aod_mix_65 | 36746 | 120259 |
| aod_mix_66 | 35872 | 121133 |
| aod_mix_67 | 29471 | 127534 |
| aod_mix_68 | 28945 | 128060 |
| aod_mix_69 | 28730 | 128275 |
| aod_mix_70 | 28666 | 128339 |
| aod_mix_71 | 28061 | 128944 |
| aod_mix_72 | 28061 | 128944 |
| aod_mix_73 | 28068 | 128937 |
| aod_mix_74 | 28088 | 128917 |
First, we notice that there are no missing values for the Date and PM2.5 variables, which is excellent, as these are arguably our two most important variables.
We can also notice that there are the same amount of recorded and missing values for each of the 8 AOD variables. If we examine these 8 variables further, we find that they are a “package deal”; for each observation, there is either a recorded value for all 8 of these variables, or a missing value for all 8 variables.
Unfortunately, the same cannot be said for the 74 AOD mixture
variables. From the table above, we can clearly see that the number of
available observations varies for each of the 74 mixtures. However, of
these 74 mixtures, the mixtures with the fewest number of recorded
observations (aod_mix_71 and aod_mix_72) each
have 28666 recorded values. Furthermore, a table containing all 74
mixtures would have 20283 observations which have a recorded value for
each of the 74 mixtures, which is a fair amount of data to work
with.
Next, we will create some charts and plots of the matched MISR data, to get visual representations of the data which we have collected.
First, we will create a “correlation heatmap” to visually depict the correlations between the 74 AOD mixtures which were collected in the MISR data. In the correlation heatmap shown below, the correlations between these different mixtures are measured from -1 to 1, and each square in the heatmap is coloured in, with it’s colour and intensity proportional to the correlation between the variables.
Correlation Heatmap for the 74 MISR Mixtures
As we can clearly see in the correlation heatmap displayed above, the 74 AOD mixtures in the collected MISR data are all strongly correlated with one another, as the entire heatmap is green.
In fact, the weakest correlation between a pair of these 74 AOD mixtures is 0.681, which is the correlation between aod_mix_01 and aod_mix_44, which is still considered to be a strong positive linear relationship between two variables.
Next, we will test a variety of different model fitting techniques on our dataset in order to determine which models are generally more efficient and serve as better models to make predictions for our dataset.
We will create a whole host of different models, as we have multiple different values in these two datasets which we want to predict, and there are multiple different sets of predictors which we aim to incorporate.
The 6 main values which we want to predict are; PM2.5, \(\text{SO}_{4}^{2-}\) (sulfate), \(\text{NO}_{3}^{-}\) (nitrate), dust mass, elemental carbon, and organic carbon. The two primary sets of predictors which we want to use are the 8 measured AOD values, and the 74 MISR AOD mixtures.
In addition to these two sets of predictors mentioned above, we will also introduce a “Months” variable to help account for the changes in these values over time. The Months variable will be computed by determining how many months it has been since March 2000, which represents the beginning of our collected data.
First, we will remove all rows with missing values for these desired predictors, and then we will split both of our datasets into a training dataset, a validation dataset, and a test dataset, with a 70/15/15 split for the training, validation, and test datasets, respectively.
| nrounds | eta | max_depth | gamma | colsample_bytree | min_child_weight | subsample | RMSE | R2 |
|---|---|---|---|---|---|---|---|---|
| 100 | 0.1 | 10 | 0.01 | 0.50 | 0 | 0.50 | 6.143790 | 0.5231206 |
| 100 | 0.3 | 10 | 0.01 | 0.50 | 0 | 0.50 | 5.892880 | 0.5606506 |
| 100 | 0.6 | 10 | 0.01 | 0.50 | 0 | 0.50 | 6.315855 | 0.5223606 |
| 100 | 1.0 | 10 | 0.01 | 0.50 | 0 | 0.50 | 7.803262 | 0.4116562 |
| 100 | 0.1 | 10 | 0.01 | 0.75 | 0 | 0.50 | 6.130884 | 0.5242569 |
| 100 | 0.3 | 10 | 0.01 | 0.75 | 0 | 0.50 | 5.764316 | 0.5794773 |
| 100 | 0.6 | 10 | 0.01 | 0.75 | 0 | 0.50 | 6.378498 | 0.5135815 |
| 100 | 1.0 | 10 | 0.01 | 0.75 | 0 | 0.50 | 8.204299 | 0.3825856 |
| 100 | 0.1 | 10 | 0.01 | 1.00 | 0 | 0.50 | 6.048014 | 0.5365660 |
| 100 | 0.3 | 10 | 0.01 | 1.00 | 0 | 0.50 | 5.911316 | 0.5594701 |
| 100 | 0.6 | 10 | 0.01 | 1.00 | 0 | 0.50 | 6.229027 | 0.5368191 |
| 100 | 1.0 | 10 | 0.01 | 1.00 | 0 | 0.50 | 8.326968 | 0.3675555 |
| 100 | 0.1 | 10 | 0.01 | 0.50 | 1 | 0.50 | 6.110199 | 0.5279643 |
| 100 | 0.3 | 10 | 0.01 | 0.50 | 1 | 0.50 | 6.023779 | 0.5432236 |
| 100 | 0.6 | 10 | 0.01 | 0.50 | 1 | 0.50 | 6.201006 | 0.5371082 |
| 100 | 1.0 | 10 | 0.01 | 0.50 | 1 | 0.50 | 7.932602 | 0.3896417 |
| 100 | 0.1 | 10 | 0.01 | 0.75 | 1 | 0.50 | 6.115028 | 0.5262989 |
| 100 | 0.3 | 10 | 0.01 | 0.75 | 1 | 0.50 | 5.976784 | 0.5498814 |
| 100 | 0.6 | 10 | 0.01 | 0.75 | 1 | 0.50 | 6.459489 | 0.5092546 |
| 100 | 1.0 | 10 | 0.01 | 0.75 | 1 | 0.50 | 8.371160 | 0.3448535 |
| 100 | 0.1 | 10 | 0.01 | 1.00 | 1 | 0.50 | 6.087237 | 0.5309108 |
| 100 | 0.3 | 10 | 0.01 | 1.00 | 1 | 0.50 | 5.873801 | 0.5650744 |
| 100 | 0.6 | 10 | 0.01 | 1.00 | 1 | 0.50 | 6.448164 | 0.5128317 |
| 100 | 1.0 | 10 | 0.01 | 1.00 | 1 | 0.50 | 8.138078 | 0.3796206 |
| 100 | 0.1 | 10 | 0.01 | 0.50 | 0 | 0.75 | 5.940582 | 0.5555160 |
| 100 | 0.3 | 10 | 0.01 | 0.50 | 0 | 0.75 | 5.788754 | 0.5758618 |
| 100 | 0.6 | 10 | 0.01 | 0.50 | 0 | 0.75 | 5.867522 | 0.5734686 |
| 100 | 1.0 | 10 | 0.01 | 0.50 | 0 | 0.75 | 6.762700 | 0.4869584 |
| 100 | 0.1 | 10 | 0.01 | 0.75 | 0 | 0.75 | 5.894807 | 0.5615305 |
| 100 | 0.3 | 10 | 0.01 | 0.75 | 0 | 0.75 | 5.706754 | 0.5870975 |
| 100 | 0.6 | 10 | 0.01 | 0.75 | 0 | 0.75 | 6.052811 | 0.5534343 |
| 100 | 1.0 | 10 | 0.01 | 0.75 | 0 | 0.75 | 6.901059 | 0.4772332 |
| 100 | 0.1 | 10 | 0.01 | 1.00 | 0 | 0.75 | 5.964500 | 0.5497997 |
| 100 | 0.3 | 10 | 0.01 | 1.00 | 0 | 0.75 | 5.809995 | 0.5727164 |
| 100 | 0.6 | 10 | 0.01 | 1.00 | 0 | 0.75 | 5.887941 | 0.5714642 |
| 100 | 1.0 | 10 | 0.01 | 1.00 | 0 | 0.75 | 6.637553 | 0.5049771 |
| 100 | 0.1 | 10 | 0.01 | 0.50 | 1 | 0.75 | 5.951101 | 0.5536658 |
| 100 | 0.3 | 10 | 0.01 | 0.50 | 1 | 0.75 | 5.735304 | 0.5834862 |
| 100 | 0.6 | 10 | 0.01 | 0.50 | 1 | 0.75 | 6.102493 | 0.5439630 |
| 100 | 1.0 | 10 | 0.01 | 0.50 | 1 | 0.75 | 6.315434 | 0.5472771 |
| 100 | 0.1 | 10 | 0.01 | 0.75 | 1 | 0.75 | 5.952589 | 0.5523002 |
| 100 | 0.3 | 10 | 0.01 | 0.75 | 1 | 0.75 | 5.878688 | 0.5634303 |
| 100 | 0.6 | 10 | 0.01 | 0.75 | 1 | 0.75 | 5.751518 | 0.5880406 |
| 100 | 1.0 | 10 | 0.01 | 0.75 | 1 | 0.75 | 6.677700 | 0.4969724 |
| 100 | 0.1 | 10 | 0.01 | 1.00 | 1 | 0.75 | 5.954298 | 0.5510470 |
| 100 | 0.3 | 10 | 0.01 | 1.00 | 1 | 0.75 | 5.819988 | 0.5714425 |
| 100 | 0.6 | 10 | 0.01 | 1.00 | 1 | 0.75 | 5.912874 | 0.5702486 |
| 100 | 1.0 | 10 | 0.01 | 1.00 | 1 | 0.75 | 6.862316 | 0.4783951 |
| 100 | 0.1 | 10 | 0.01 | 0.50 | 0 | 1.00 | 5.881235 | 0.5643552 |
| 100 | 0.3 | 10 | 0.01 | 0.50 | 0 | 1.00 | 5.582362 | 0.6052813 |
| 100 | 0.6 | 10 | 0.01 | 0.50 | 0 | 1.00 | 5.764459 | 0.5841651 |
| 100 | 1.0 | 10 | 0.01 | 0.50 | 0 | 1.00 | 6.437772 | 0.5213619 |
| 100 | 0.1 | 10 | 0.01 | 0.75 | 0 | 1.00 | 5.875418 | 0.5640211 |
| 100 | 0.3 | 10 | 0.01 | 0.75 | 0 | 1.00 | 5.663587 | 0.5932175 |
| 100 | 0.6 | 10 | 0.01 | 0.75 | 0 | 1.00 | 5.732368 | 0.5888061 |
| 100 | 1.0 | 10 | 0.01 | 0.75 | 0 | 1.00 | 6.347134 | 0.5374107 |
| 100 | 0.1 | 10 | 0.01 | 1.00 | 0 | 1.00 | 5.951434 | 0.5511222 |
| 100 | 0.3 | 10 | 0.01 | 1.00 | 0 | 1.00 | 5.792958 | 0.5755533 |
| 100 | 0.6 | 10 | 0.01 | 1.00 | 0 | 1.00 | 5.764137 | 0.5855169 |
| 100 | 1.0 | 10 | 0.01 | 1.00 | 0 | 1.00 | 6.178901 | 0.5500489 |
| 100 | 0.1 | 10 | 0.01 | 0.50 | 1 | 1.00 | 5.827251 | 0.5737919 |
| 100 | 0.3 | 10 | 0.01 | 0.50 | 1 | 1.00 | 5.559694 | 0.6080988 |
| 100 | 0.6 | 10 | 0.01 | 0.50 | 1 | 1.00 | 5.801874 | 0.5771998 |
| 100 | 1.0 | 10 | 0.01 | 0.50 | 1 | 1.00 | 6.383772 | 0.5185127 |
| 100 | 0.1 | 10 | 0.01 | 0.75 | 1 | 1.00 | 5.892501 | 0.5614501 |
| 100 | 0.3 | 10 | 0.01 | 0.75 | 1 | 1.00 | 5.587958 | 0.6041747 |
| 100 | 0.6 | 10 | 0.01 | 0.75 | 1 | 1.00 | 5.709844 | 0.5911838 |
| 100 | 1.0 | 10 | 0.01 | 0.75 | 1 | 1.00 | 6.357415 | 0.5306110 |
| 100 | 0.1 | 10 | 0.01 | 1.00 | 1 | 1.00 | 5.951434 | 0.5511222 |
| 100 | 0.3 | 10 | 0.01 | 1.00 | 1 | 1.00 | 5.792958 | 0.5755533 |
| 100 | 0.6 | 10 | 0.01 | 1.00 | 1 | 1.00 | 5.764137 | 0.5855169 |
| 100 | 1.0 | 10 | 0.01 | 1.00 | 1 | 1.00 | 6.178901 | 0.5500489 |
Of the 72 xgboost models which we fitted as seen in the table above, the lowest RMSE among the 96 models was 5.559694, and the highest R2 value among these models was 0.6080988.
Coincidentally, the xgboost model which attained the lowest RMSE was
also the xgboost model which attained the highest R2 value. This model
had the parameters nrounds = 100, eta = 0.3,
max_depth = 10, gamma = 0.01,
colsample_bytree = 0.50, min_child_weight = 1,
and subsample = 1.